Red Wine Exploration by Darui Zhang

What property makes great red wine great? In this project we try to answer this question by exploring the red wine data set.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Univariate Plots Section

Feature Names and Summary

This red wine data set contains 1,599 obersvations with 11 variables on the chemical properties of the wine.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Quality Distribution

The wine quality grade is a discrete number. It is ranged from 3 to 8. The median value is at 6.

Distribution of Other Chemical Properties

## Warning: position_stack requires constant width: output may be incorrect

Univariate Analysis

Some observed on the distribution of the chemical property can be made:

  • Normal: Volatile acidity, Density, PH

  • Positively Skewed: Fixed acidity, Citric acid, Free sulfur dioxide, Total sulfur dioxide, Sulphates, Alcohol

  • Long Tail: Residual sugar, Chlorides

Rescale Variable

The skewed and long tail data can be transformed toward more normally distribution by taking square root or log function. Take Sulphates as a example, we compare the original, square root and log of the feature.

Both the square root and the log function helps transform the feature toward normal distribution. In comparison, the log scale feature is more normal distributed.

Bivariate Plots Section

Bivariate Plots Selection

Plot matrix was used to have a glance at the data. We are interested the correlation between the wine quality and each chemical property.

The top 4 factor that is correlated with the wine quality (with a correlation value greater than 0.2)

Property r-value
alcohol 0.476
volatile.acidity -0.391
sulphates 0.251
citric.acid 0.226

Bivariate Analysis

Alcohol content has the biggest correlation value to the wine quality. The scatter plot of alcohol and wine quality is shown below.

The original plot looks over plotted, so we add alpha value and 0.1, 0.5 and 0.9 percentile line to show the general trends.

In this plot the trend of increasing wind quality with the increasing of alcohol content can be clearly observed.

Transforming Wine Quality into Categorical Data

Since the wine quality are desecrate value, we can transform it from numerical data to categorical data. So that box plot can be used to represent the data.

Higher quality wine have higher alcholol content in generally.

Similar analysis was done the 3 other factors: volatile acidity, sulphates and citric acid

Distribution Analysis

In this analysis, we try to find if the distribution of the chemical properties are different in each wine quality.

Note that sine the data size for each quality is not equal, the distribution of higher and lower grades are hard to see.

A normalized plot is shown below.

The plot looks a little busy. We ground 2 grade together: grade 3,4 as “Low”, grade 5,6 as “Medium”, grade 7,8 as “High”. And plot again.

The new plot looks cleaner.

Similar analysis was done the 3 other factors: volatile acidity, sulphates and citric.acid

As stated in section 1 the sulphates data is skewed, we tried both the original and the log scale of the feature.

The log scaled feature looks better.

Median Value of Each Chemical Properties

Correlation Between Features

There is interesting correlaiton between two of the main features: Volatile acidity and Citric acid.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## 
##  Pearson's product-moment correlation
## 
## data:  redwine$volatile.acidity and redwine$citric.acid
## t = -26.4891, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

Multivariate Plots Section

Main Chemical Property vs Wine Quality

With different color, we can add another dimension into the plot. There are 4 main features.Alcohol, volatile acidity are the top two factor that affect wine quality.

The figure looks over ploted, since the wine quality is discrete numbers. The use jitter plot to alleviate this problem

We can see higher quality wine have alcohol and lower volatile acidity.

Add Another Feature

Now we add the third feature, the log scale of sulphates, and use different facet to show wine grade.

We can see higher quality wine have higher alcohol (x-axis), lower volatile acidity (y-axis) and higher sulphates.

Main Chemical Properties vs Wine Quality

Since we can visualized 3 dimensions, including wine quality, at a time. Two graphs will be needed to visualize the 4 main chemical properties.

The same trend of alcholand volatile acidity’s effect on wind qaulity can be observed.

We can see higher quality wine have higher sulphates (x-axis), higher citric acidity (y-axis).

Linear Multivariable Model

Linear Multivariable model was created to predict the wine quality based on chemical properties.

The features are selected incrementally in order of how strong the correlation between this feature and wine quality.

## 
## Calls:
## m1: lm(formula = quality ~ volatile.acidity, data = redwine)
## m2: lm(formula = quality ~ volatile.acidity + alcohol, data = redwine)
## m3: lm(formula = quality ~ volatile.acidity + alcohol + sulphates, 
##     data = redwine)
## m4: lm(formula = quality ~ volatile.acidity + alcohol + sulphates + 
##     citric.acid, data = redwine)
## m5: lm(formula = quality ~ volatile.acidity + alcohol + sulphates + 
##     citric.acid + chlorides, data = redwine)
## m6: lm(formula = quality ~ volatile.acidity + alcohol + sulphates + 
##     citric.acid + chlorides + total.sulfur.dioxide, data = redwine)
## m7: lm(formula = quality ~ volatile.acidity + alcohol + sulphates + 
##     citric.acid + chlorides + total.sulfur.dioxide + density, 
##     data = redwine)
## 
## ==================================================================================================
##                           m1         m2         m3         m4         m5         m6         m7    
## --------------------------------------------------------------------------------------------------
## (Intercept)            6.566***   3.095***   2.611***   2.646***   2.769***   2.985***   -0.953   
##                       (0.058)    (0.184)    (0.196)    (0.201)    (0.202)    (0.206)    (11.990)  
## volatile.acidity      -1.761***  -1.384***  -1.221***  -1.265***  -1.155***  -1.104***   -1.114***
##                       (0.104)    (0.095)    (0.097)    (0.113)    (0.115)    (0.115)     (0.120)  
## alcohol                           0.314***   0.309***   0.309***   0.292***   0.276***    0.280***
##                                  (0.016)    (0.016)    (0.016)    (0.016)    (0.017)     (0.020)  
## sulphates                                    0.679***   0.696***   0.871***   0.908***    0.903***
##                                             (0.101)    (0.103)    (0.111)    (0.111)     (0.112)  
## citric.acid                                            -0.079      0.021      0.065       0.044   
##                                                        (0.104)    (0.106)    (0.106)     (0.124)  
## chlorides                                                         -1.663***  -1.763***   -1.747***
##                                                                   (0.405)    (0.403)     (0.406)  
## total.sulfur.dioxide                                                         -0.002***   -0.002***
##                                                                              (0.001)     (0.001)  
## density                                                                                   3.923   
##                                                                                         (11.944)  
## --------------------------------------------------------------------------------------------------
## R-squared                 0.153      0.317      0.336      0.336      0.343      0.352      0.352 
## adj. R-squared            0.152      0.316      0.335      0.334      0.341      0.349      0.349 
## sigma                     0.744      0.668      0.659      0.659      0.656      0.651      0.652 
## F                       287.444    370.379    268.912    201.777    166.407    143.910    123.298 
## p                         0.000      0.000      0.000      0.000      0.000      0.000      0.000 
## Log-likelihood        -1794.312  -1621.814  -1599.384  -1599.093  -1590.662  -1580.192  -1580.138 
## Deviance                883.198    711.796    692.105    691.852    684.595    675.689    675.643 
## AIC                    3594.624   3251.628   3208.768   3210.186   3195.324   3176.384   3178.276 
## BIC                    3610.756   3273.136   3235.654   3242.448   3232.964   3219.401   3226.670 
## N                      1599       1599       1599       1599       1599       1599       1599     
## ==================================================================================================

The model of 6 features has the lowest AIC (Akaike information criterion) number. As the number of features increasing the AIC becomes higher. The parameter of the predictor also changed dramatically which shows a sign of overfitting.

The model can be described as: wine_quality = 2.985 + 0.276xalcohol - 2.985xvolatile.acidity + 0.908xsulphates + 0.065xcitric.acid - -1.763*chlorides - 0.002xtotal.sulfur.dioxide


Final Plots and Summary

Plot One

Description One

The median value of chemical properties at each wine quality is shown. The value is normalize by the maximum value so that all the values ranges from 0 to 1. The features with monotonically increasing or decreasing trends has higher correlation with the wine quality, such as volatile acidity. Features that are flat or not monotonical have less correlation with wine quality, such s density and free sulfur dioxide.

Plot Two

Description Two

The 4 features that have the highest correlation coefficient are alcohol(0.476), volatile acidity(-0.391), sulphates(0.251),citric acid(0.226). The wine quality are grouped to low (3,4) medium (5.6) and high(7,8). High quality wine have high alcohol level however, there is no significant different between medium and low quality wine. Volatile acidity decrease as wine quality increases. Sulphate and critic increase as wine quality increase.

Plot Three

Description Three

The 4 features are also represented in the scatter plot. 2 features are plotted at a time with color indicate wine quality. Similar trend as the last figure can be observed.


Reflection

The red dataset contains 1,599 observation with 11 variables on the chemical properties. We are interested in the correlation between the features and wine quality. Unlike the diamond price, which is the dominated by their size or carat. The wine quality is more complex. It does not have a obvious driver. Most of the data visualization in this project was done on the 4 features that have the highest correlation coefficient: alcohol(0.476), volatile acidity(-0.391), sulphates(0.251),citric acid(0.226). After some web research, the reflection about these chemical component are as follows.

Surprisingly, other chemical proprieties do not have strong correlation with wine quality, such as the residual sugar and PH .

In the end, a linear model of 6 features was created to predict wine quality. However, wine quality is a complex object. Different type grape can largely affect the wine test. There are many nuance in taste and aroma the that cannot be capture by examine its chemical component. The linear model is a overly simplified model. Good wine is more than perfect combination of different chemical components.